RNA seq analysis of the St Louis cohort

I first read in the raw data:

countData<-read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/Local_cohort/RNAseq/Global_table_StLouis_completed.xlsx")
colnames(countData) <- as.character(countData[1,]) # set patient names as colnames
rownames(countData) <- countData$ID # set gene names as rownames
countData_withinfo <- countData[,-1]

I can then look at the extra information on the patients’ unmapped genes, lib size, …

## [1] "D708"

## [1] "D708"
## [1] 60433    80

It seems like the patient “D708” might be a problematic sample, I keep it in mind for the rest of the analysis.

I then remove the genes that are not present in both cohorts:

All the genes are common to both cohorts, I don’t need to remove any of them.

I then load in the clinical data, and reorder the RNAseq and clinical data tables so that they are in the same order:

colData<-read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/Local_cohort/Data synthesis local cohort Saint-Louis 17012019.xlsx")
colData <- colData[!is.na(colData$RNAseq.ID),] # remove rows with no RNAseq sample
rownames(colData) <- colData$Id.Cryostem.R
colnames(colData)[which(colnames(colData)=="RNA.INTEGRITY.NUMBER.(RIN)")] <- "RNA.INTEGRITY"

### Reorder
countData<-countData[,rownames(colData)]

I then identify the donors and recipients, for which we have complete cases:

complete_couples <- colData$COUPLENUMBER[which(duplicated(colData$COUPLENUMBER))]

complete_names <- rownames(colData)[which(colData$COUPLENUMBER %in% complete_couples)]
donor_names <- complete_names[grep("D", complete_names)]
donor_id <- which(colnames(countData) %in% donor_names)

recip_names <- complete_names[grep("R", complete_names)]
recip_id <- which(colnames(countData) %in% recip_names)

Finally, I generate the age and gender vectors that will be used for modeling in all patients (donors, recipients and couples):

## [1] "F"
## [1] "M"
## [1] 8264
##       COUPLENUMBER GENDER        DOG        DOB              GROUP
## D1502            3      F       <NA> 1990-04-03   Primary_tolerant
## R1503            3      M 2014-01-02 1991-05-19   Primary_tolerant
## D2031            6      M       <NA> 1958-03-24 Secondary_tolerant
## R2032            6      M 2014-04-22 1971-04-27 Secondary_tolerant
## D369             7      M       <NA> 1994-07-10 Secondary_tolerant
## R370             7      M 2013-04-23 1996-04-13 Secondary_tolerant
##       RNA.INTEGRITY CMVStatus gender_comp age_recip
## D1502           7.4         1          FM      8264
## R1503           8.4         1          FM      8264
## D2031           7.2         0          MM     15701
## R2032           8.4         0          MM     15701
## D369            7.8         0          MM      6219
## R370            7.4         0          MM      6219

Analysis on the recipients

Preprocessing :

I will first focus on the recipients only, and perform a standard preprocessing (keep genes with more than one count per million in at least three patients):

The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:

Perform normalisation and initiate the design matrix (per group)

I save the produced objects:

Load the produced data :

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/expression_table_recip.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/data_for_DE_recip.RData")

Differential Analysis

Fit a model to the design:

##                     Contrasts
## Levels               group1 group2 group3
##   non_tolerant            1      1      0
##   secondary_tolerant      0     -1     -1
##   primary_tolerant       -1      0      1

Adjust the pvalues for multiple testing :

## [1] "Non tolerant vs primary: 226"
## [1] "Non tolerant vs secondary: 117"
## [1] "Primary vs secondary: 0"

Plot a triwise plot and save an interactive triwise plot: It seems like the majority of differentially expressed genes are being overexpressed in tolerant recipients.

## [1] 282

Machine learning approach :

We first filter out lowly variable genes:

Prepare the data:

We perform a PCA on the recipients to see if we already observe some structure in the data, and if there isn’t too much batch effect: On the PCA, some tolerant recipients seem to be quite separated from the rest of the patients (more on the left). They correspond to the high CMV patients that we had already identified as quite different in the CyTOF data

We can try to build a random forest model using all the genes of the recipients:

##                     predicted
## true                 non_tolerant primary_tolerant secondary_tolerant
##   non_tolerant                 17                0                  1
##   primary_tolerant              4                1                  2
##   secondary_tolerant            3                3                  1
## [1] 40.625

But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:

First, we try to separate tolerant from non tolerant recipients:

set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 28.12%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           14        4   0.2222222
## tolerant                5        9   0.3571429

Second, we can try to separate primary tolerant from secondary tolerant recipients:

set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
## 
## Call:
##  randomForest(formula = group ~ ., data = df_tol, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 85.71%
## Confusion matrix:
##                    primary_tolerant secondary_tolerant class.error
## primary_tolerant                  1                  6   0.8571429
## secondary_tolerant                6                  1   0.8571429

But it seems like we can’t classify between these two groups of patients yet.

Feature selection

Identifying features that vary the most between non and tolerant recipients:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the non vs tolerance:

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/norm_data_TNT.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

We isolate the genes that were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## Note: zip::zip() is deprecated, please use zip::zipr() instead
## [1] "1876 genes were selected with a 0.95 threshold in the St Louis cohort."

We can see if the features that we selected with feature selection correspond to the genes that had been found through differential analysis:

## [1] "254 out of the 282 genes that were identified as differentially expressed were retrieved with our feature selection method."

We can then see which of these selected genes were common with the Cryostem cohort:

sel_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "140 genes were common between the two cohorts."
#selected_genes_qt[sel_common]

Graph of the 140 selected genes that were found commonly between the two cohorts:

I save the 140 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)

We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

Graph of the 111 selected genes that were found commonly between the two cohorts:

I save these features:

Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:

##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                2       12
## [1] "prediction error: 18.75"

We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 111 selected features isn’t better than the one built on the 140 common genes:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.75%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           15        3   0.1666667
## tolerant                3       11   0.2142857

Threshold of 0.90

We isolate the genes that were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:

qt <- unlist(map(perm_vals, 1))
densityplot(qt)

selected_genes_qt <- colnames(norm_data)[which(qt>.90)+1]
write.xlsx(selected_genes_qt, file = "../data/RNAseq/recip_only/perm_val_TNT_90_thresh.xlsx")

We can see if the features that we selected with feature selection correspond to the genes that had been found through differential analysis:

## [1] "264 out of the 282 genes that were identified as differentially expressed were retrieved with our feature selection method."

We can then see which of these selected genes are common with the Cryostem cohort:

## [1] "408 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

I save the 408 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)

We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

Graph of the 278 selected genes that had the same behaviour in the two cohorts:

I save the 278 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)

Now that we have selected features that should help us classify tolerant and non tolerant recipients, we can see that a RF model built on the 408 selected features is slightly better than the one built on all genes:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.62%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           15        3   0.1666667
## tolerant                2       12   0.1428571

We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 278 features. But the RF model built on the 278 selected features isn’t better than the one built on the 408 common genes:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.75%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           15        3   0.1666667
## tolerant                3       11   0.2142857

Including age and gender in the analysis

Of the 408 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

Some genes (end of the second “bump” in the density plot) seem to still add information on the tolerance even after correction for gender and age:

names(qt_sel)[which(qt_sel>0.8)]
##  [1] "ILDR2"           "ENSG00000242861" "ENSG00000272735"
##  [4] "VIL1"            "RNF212"          "CLNK"           
##  [7] "PARM1"           "CAMK4"           "FLT4"           
## [10] "HIST1H4A"        "TREML1"          "RAB39A"         
## [13] "AGAP12P"         "ENSG00000229668" "ENSG00000273824"
## [16] "ENSG00000278013" "ENSG00000278909" "ITGA3"          
## [19] "ENGASE"          "NPTX1"           "MAP1LC3A"       
## [22] "MAFB"            "SLC27A1"         "THAP7.AS1"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]

Which of the 408 selected genes were most linked to gender?

## [1] "LCN2"            "NTNG2"           "KCNJ2"           "TGM2"           
## [5] "RPS9"            "ENSG00000211459"

Boxplots of the most gender related genes:

Which of the 408 selected genes were most linked to age?

## [1]  54  82 164 256

We can plot the genes that seemed to be most linked to the age of the recipients:

Identifying features that vary the most between primary and secondary tolerant recipients:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the non vs tolerance:

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/norm_data_PS.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## [1] "343 genes were above the 0.95 threshold in the St Louis cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "3 genes were common between the two cohorts: "
## [1] "ENSG00000237818" "KLRC2"           "ENSG00000251876"

I save the 3 features in an excel file: The excel file contains, for every gene, the associated logFC, pvalue…, where the comparison is in primary versus secondary tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in primary tolerant patients)

We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

I save this feature:

Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                3                  4
## [1] "prediction error: 35.7142857142857"

Threshold of 0.90

These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:

## [1] "723 genes were above the 0.95 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

sel_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "45 genes were common between the two cohorts."
# selected_genes_qt[sel_common]

Graph of these selected genes that were found commonly between the two cohorts:

I save the 45 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)

We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

Graph of the 24 genes that were found commonly in the two cohorts AND had the same behaviour in the two cohorts:

sel_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_PS_90_thresh.xlsx")
selb <- as.character(sel_beh$sel_beh)
correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

# recip_path <- "~/Desktop/"
# 
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

# dev.off()

I save the 24 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)

Now that we have selected features that should help us classify tolerant and non tolerant recipients, we can see that a RF model built on these selected features is slightly better than the one built on all genes:

set.seed(1)
rf <- ranger::ranger(group~., data = rna_recip_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  6                  1
##   secondary_tolerant                2                  5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 21.4285714285714"

We can also try to use only the features that were common and behaved similarly in the two cohorts (24 features) . But the RF model built on the 24 selected features isn’t better than the one built on the 45 common genes:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
## [1] "prediction error: 28.5714285714286"

Including age and gender in the analysis

Of the 45 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

A few of the selected genes (second “bump” in the density plot) still seem to add information on the tolerance even after correction for gender and age:

## [1] "HIST1H2AC"       "ENSG00000280981" "GIMAP5"          "KLRC2"          
## [5] "ENSG00000257246" "TOP2A"           "ENSG00000213279"

Which of the 45 selected genes were most linked to gender?

## [1] "ENSG00000237818" "GIMAP5"          "LRRC6"           "KLRC2"          
## [5] "FAM83D"          "ENSG00000267838"

Boxplots of these most gender related genes:

Which of the 45 selected genes were most linked to age?

## [1] 34 40

We can plot the genes that seemed to be most linked to the age of the recipients:

Analysis on the donors only :

The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:

Perform normalisation and initiate the design matrix (per group)

Save the produced objects:

Load the produced data :

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/expression_table_donors.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/data_for_DE_donors.RData")

Differential Analysis

Fit a model to the design:

##                     Contrasts
## Levels               group1 group2 group3
##   non_tolerant            1      1      0
##   secondary_tolerant      0     -1     -1
##   primary_tolerant       -1      0      1

Adjust the pvalues for multiple testing :

## [1] "Non tolerant vs primary: 0"
## [1] "Non tolerant vs secondary: 0"
## [1] "Primary vs secondary: 0"

Machine learning approach :

filter out lowly variable genes:

Prepare the data:

We perform a PCA on the donors to see if we already observe some structure in the data, and if there isn’t too much batch effect: We don’t observe much structure, relatively to the tolerance groups, or batches.

We can try to build a random forest model using all the genes of the donors:

##                     predicted
## true                 non_tolerant primary_tolerant secondary_tolerant
##   non_tolerant                 16                2                  0
##   primary_tolerant              5                2                  0
##   secondary_tolerant            6                1                  0
## [1] 43.75

But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:

First, we try to separate tolerant from non tolerant donors:

set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 34.38%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           13        5   0.2777778
## tolerant                6        8   0.4285714

Second, we can try to classify primary tolerant from secondary tolerant donors:

set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
## 
## Call:
##  randomForest(formula = group ~ ., data = df_tol, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 42.86%
## Confusion matrix:
##                    primary_tolerant secondary_tolerant class.error
## primary_tolerant                  4                  3   0.4285714
## secondary_tolerant                3                  4   0.4285714

But the error rates are still very high, even after splitting the classification problem in two.

Feature selection

Identifying features that vary the most between non and tolerant recipients:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the non vs tolerance:

norm_data$group <- as.character(norm_data$group)
norm_data$group[which(norm_data$group!="non_tolerant")] <- "tolerant"
norm_data$group <- as.factor(norm_data$group)

perm_vals_PRISM(norm_data = norm_data, PRISM_name = "rnaseq_d_TNT",
                rds_name = "handle_louis_rna_d_TNT.rds")

handle3 <- readRDS("handle_louis_rna_d_TNT.rds")
perm_vals <- qsub_retrieve(handle3)

save(perm_vals, file = "~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_TNT.RData")
save(norm_data, file= "~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_TNT.RData")

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_TNT.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## [1] "1312 genes were abonve the 0.95 threshold in the St Louis cohort."

Which of these selected genes are common with the Cryostem cohort?

sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "70 genes were common between the two cohorts."
# selected_genes_qt[sel_common]

Graph of these selected genes that were found commonly between the two cohorts:

I save these features:

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

I save the 5 features that had the same behaviouir in the 2 cohorts in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolerant patients)

Now that we have selected features that should help us classify non and tolerant donorients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:

set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           13        5
##   tolerant                5        9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 31.25"

We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 5 selected features isn’t better than the one built on the 70 common genes:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 46.88%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           10        8   0.4444444
## tolerant                7        7   0.5000000

Threshold of 0.90

These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:

## [1] "2592 genes were abonve the 0.90 threshold in the St Louis cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "310 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

correlations <- norm_data[,selected_genes_qt[sel_common]] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#0000FF80"
  }
}

# recip_path <- "~/Desktop/"
# 
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

# dev.off()

I save these features:

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)

Graph of these 22 selected genes that were found commonly between the two cohorts:

I save these features:

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:

set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                5        9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"

We can do the same on the 22 genes that had the same behaviour in the two cohorts, but it doesn’t return better results:

set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_beh_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           10        8
##   tolerant               11        3
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 59.375"

Including age and gender in the analysis

Of the 310 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

A few genes (tale of the second “bump” in the density plot) still add information on the tolerance even after correction for gender and age:

##  [1] "ZSCAN20"         "ENSG00000230896" "TTC22"          
##  [4] "ENSG00000189195" "LOC653513"       "ENSG00000201076"
##  [7] "ZNF2"            "PROB1"           "NIPAL4"         
## [10] "ACOT13"          "ZKSCAN4"         "KHDC1"          
## [13] "ENSG00000271789" "GIMAP8"          "ENSG00000253372"
## [16] "C9orf64"         "ENSG00000255186" "UQCC3"          
## [19] "ENSG00000138161" "ENSG00000255776" "ENSG00000256937"
## [22] "CD69"            "GPR68"           "ENSG00000258875"
## [25] "ENSG00000277999" "ENSG00000279294" "ENSG00000267074"
## [28] "TBCD"            "SNORD17"         "ENSG00000273893"
## [31] "ZSWIM9"          "ETFB"            "ZNF613"         
## [34] "ENSG00000253590" "ADORA2A.AS1"     "LSS"            
## [37] "MCM3AP.AS1"

Which of the 310 selected genes were most linked to gender?

## [1] "TCF7"  "VENTX"

Boxplots of these most gender related genes:

Which of the 310 selected genes were most linked to age?

## [1]  99 111 159

We can plot the genes that seemed to be most linked to the age of the recipients:

Identifying features that vary the most between primary and secondary tolerant donors:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the non vs tolerance:

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_PS.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## [1] "810 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "26 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 6 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_95_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  6                  1
##   secondary_tolerant                1                  6
## [1] "prediction error: 14.2857142857143"

We can also build a RF model using only the 6 genes that had the same behaviour in both cohorts, but the classification resutls are slightly worse:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
## [1] "prediction error: 28.5714285714286"

Threshold of 0.90

These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:

## [1] "1498 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "104 genes were common between the two cohorts."
# selected_genes_qt[sel_common]

Graph of these selected genes that were found commonly between the two cohorts:

correlations <- norm_data[,selected_genes_qt[sel_common]] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

# recip_path <- "~/Desktop/"
# 
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

# dev.off()

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 6 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_90_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:

rna_donors_louis_PS_90 <- readRDS("../data/RNAseq/donors_only/rna_donor_louis_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donors_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                1                  6
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 21.4285714285714"

We can also build a RF only on the 16 genes that had the same behaviour in both cohorts, but this makes the classification slightly worse:

rna_donors_louis_beh_PS_90 <- readRDS("../data/RNAseq/donors_only/rna_donor_louis_beh_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donors_louis_beh_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"

Including age and gender in the analysis

Of the 104 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

A few genes (tale of the second “bump” in the density plot) still add information on the tolerance even after correction for gender and age:

## [1] "MTRNR2L2"        "ENSG00000228166" "ENSG00000235641" "ENSG00000211801"
## [5] "ENSG00000211802"

Which of the 104 selected genes were most linked to gender?

## [1] "ENSG00000260398" "LRRC6"           "UTY"             "KDM5D"

Boxplots of these most gender related genes:

Which of the 104 selected genes were most linked to age?

We can plot the genes that seemed to be most linked to the age of the recipients:

Analysis on the donors - recipients

## [1] 18672    64

The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:

Perform normalisation and initiate the design matrix (per group)

Save the produced objects:

Load the data :

Machine learning approach :

filter out lowly variable genes:

Prepare the data:

We can apply a principal components analysis to see if there is some structure in the data that can already be seen:

In the following PCA plot, we plot the recipients as black dots, the donors as white dots, and we link donors and recipients that belong to the same couple by a line of the color of their group.

It looks like, in the RNAseq data of the St Louis cohort, the distances between non tolerant donors and recipients are not really bigger from the distances between primary or secondary tolerant donors and recipients.

We can already try to classify the couples based on all the genes:

##                     predicted
## true                 non_tolerant primary_tolerant secondary_tolerant
##   non_tolerant                 16                0                  2
##   primary_tolerant              5                0                  2
##   secondary_tolerant            4                2                  1
## [1] 0.46875

Trying to classify between non tolerant and tolerant couples:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tmp, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 34.38%
## Confusion matrix:
##              non_tolerant tolerant class.error
## non_tolerant           15        3   0.1666667
## tolerant                8        6   0.5714286

Trying to classify only primary and seconday tolerant couples:

## 
## Call:
##  randomForest(formula = group ~ ., data = df_tol, mtry = mtry,      ntree = ntree, importance = T, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 64.29%
## Confusion matrix:
##                    primary_tolerant secondary_tolerant class.error
## primary_tolerant                  3                  4   0.5714286
## secondary_tolerant                5                  2   0.7142857

Feature selection

Investigating into metabolites that mainly differ between tolerant and non tolerant couples:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the non vs tolerance:

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/norm_data_TNT.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## [1] "1406 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "68 genes were common between the two cohorts."
# selected_genes_qt[sel_common]

Graph of these selected genes that were found commonly between the two cohorts:

correlations <- norm_data[,selected_genes_qt[sel_common]] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#0000FF80"
  }
}

# recip_path <- "~/Desktop/"
# 
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

# dev.off()

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 6 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_95_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["non_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#FF000080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is giving similar results to the one built on all genes:

rna_dr_louis_TNT_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_TNT_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                7        7
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 34.375"

We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which slightly improves the classification:

rna_dr_louis_beh_TNT_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_beh_TNT_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_beh_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           16        2
##   tolerant                6        8
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 25"

Threshold of 0.90

These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:

## [1] "2216 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "196 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

correlations <- norm_data[,selected_genes_qt[sel_common]] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#0000FF80"
  }
}

# recip_path <- "~/Desktop/"
# 
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

# dev.off()

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 87 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_90_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["non_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#FF000080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is giving similar results to the one built on all genes:

rna_dr_louis_TNT_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_TNT_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           15        3
##   tolerant                6        8
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"

We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which doesn’t improve the classification:

rna_dr_louis_beh_TNT_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_beh_TNT_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_beh_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                5        9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"

Including age and gender in the analysis

Of the 196 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

A few genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:

names(qt_sel)[which(qt_sel>0.7)]
## [1] "subst_MIR5581"         "subst_ENSG00000272282" "subst_EOMES"          
## [4] "subst_DTHD1"           "subst_SLC4A4"          "subst_ENSG00000255186"
## [7] "subst_FOXJ1"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]

Which of the 196 selected genes were most linked to gender?

## [1] "subst_GZMK"            "subst_ENSG00000248568" "subst_ENSG00000226149"
## [4] "subst_ENSG00000260528" "subst_LOC102724545"

Boxplots of these most gender related genes:

Which of the 196 selected genes were most linked to age?

We can plot the genes that seemed to be most linked to the age of the recipients:

Identifying features that mainly differ between primary and secondary tolerant couples:

Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.

And then compute the permutation distribution for the primary vs secondary tolerance:

Compute the median per group for every gene, to use it later in the graphs:

load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/norm_data_PS.RData")

gr_medians <- norm_data %>%
  group_by(group) %>%
  summarize_if(is.numeric, median) %>%
  as.data.frame()
rownames(gr_medians) <- gr_medians$group

Threshold of 0.95

These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:

## [1] "627 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "111 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

colData_orig <- colData
# rownames(norm_data) are couple numbers -> I change the rownames of colData 
# to couples as well so that I can keep using my functions
colData <- colData_orig
colData <- colData %>% 
  dplyr::filter(grepl("R", rownames(colData))) %>% 
  mutate(COUPLE = paste0("couple_", COUPLENUMBER))
rownames(colData) <- colData$COUPLE

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 6 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_95_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:

rna_dr_louis_PS_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_PS_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_PS_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"

We can also build a RF only on the 6 genes that had the same behaviour in both cohorts, but this doesn’t make the classification better or worse:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
## [1] "prediction error: 28.5714285714286"

Threshold of 0.90

These genes were selected as informative to classify the non and tolerant couples, with a 0.90 threshold:

## [1] "1192 genes were above the 0.90 threshold in the Cryostem cohort."

Which of these selected genes are common with the Cryostem cohort?

## [1] "337 genes were common between the two cohorts."

Graph of these selected genes that were found commonly between the two cohorts:

We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:

Graph of the 15 features that had the same behaviour in both cohorts:

sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_90_thresh.xlsx")
selb <- sel_beh$sel_beh

correlations <- norm_data[,selb] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>% 
  dplyr::filter(var1 < var2)

correlated_features <- correlations %>% 
  dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))

medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
  if (medians["primary_tolerant", gene] == max(medians[, gene])){
    color_nodes[gene] <- "#00FF0080"
  }
}

set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
     vertex.color = color_nodes, layout = layout_nicely)

Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:

rna_dr_louis_PS_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"

We can also build a RF only on the 15 genes that had the same behaviour in both cohorts, but this doesn’t make the classification better or worse:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
## [1] "prediction error: 28.5714285714286"

Including age and gender in the analysis

Of the 337 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?

A few genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:

names(qt_sel)[which(qt_sel>0.7)]
##  [1] "subst_MFSD2B"          "subst_C2orf88"        
##  [3] "subst_ENSG00000269982" "subst_LOC105374344"   
##  [5] "subst_MIR4458HG"       "subst_THBS4"          
##  [7] "subst_EFNA5"           "subst_LOC100506098"   
##  [9] "subst_EPHA1"           "subst_ZNF425"         
## [11] "subst_GIMAP5"          "subst_RPA4"           
## [13] "subst_TSC22D3"         "subst_ENSG00000237359"
## [15] "subst_PSAT1"           "subst_ENSG00000256723"
## [17] "subst_ENSG00000258181" "subst_OGFOD2"         
## [19] "subst_ENSG00000211858" "subst_ENSG00000211886"
## [21] "subst_LINC02325"       "subst_LINC01550"      
## [23] "subst_SNORD116.18"     "subst_ENSG00000266378"
## [25] "subst_TUBB6"           "subst_ENSG00000273669"
## [27] "subst_ZNF613"          "subst_ENSG00000240591"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]

Which of the 337 selected genes were most linked to gender?

Boxplots of these most gender related genes:

Which of the 337 selected genes were most linked to age?

We can plot the genes that seemed to be most linked to the age of the recipients:

Combining the information on couples, recipients and donors:

Now that we have selected features at the couple level, at the recipient level and at the donor level, we can try to combine these features:

Classifying non versus tolerant patients based on all the information:

Threshold of 0.95:

We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The 2 first components of the PCA seem to hold quite a lot of the variability in the data:

plot(pca_all)

The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                4       10
## [1] "prediction error: 25"

It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:

##           R_NELL2           R_NPAS2           R_KLHL3 R_ENSG00000261685 
##         0.6181289         0.5906052         0.4239539         0.4177723 
## R_ENSG00000242861          R_DCBLD2 R_ENSG00000280123            R_C1QA 
##         0.2813899         0.2692035         0.2606036         0.2459887 
##           D_DIP2A          R_RASSF6 R_ENSG00000144642          R_PCYT1B 
##         0.2368956         0.2179848         0.2154340         0.1931664 
## R_ENSG00000261488            R_LCN2      R_LACTB2.AS1           R_FRMD3 
##         0.1788241         0.1645321         0.1639191         0.1590817 
##   subst_LINC00482 R_ENSG00000201367 D_ENSG00000269246      subst_FCER1A 
##         0.1528899         0.1441502         0.1438567         0.1426199

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "KLRC4"
## [1] "Genes that were commonly selected in recipients and couples: "
##  [1] "FCER1A"          "DYSF"            "RASSF6"         
##  [4] "ETV7"            "LACTB2.AS1"      "ANKRD22"        
##  [7] "IFNG.AS1"        "SLC41A2"         "ENSG00000267279"
## [10] "TGM2"            "SIGLEC16"        "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)

Threshold of 0.90

We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The 2 first components of the PCA seem to hold quite a lot of the variability in the data:

plot(pca_all)

The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##               predicted
## true           non_tolerant tolerant
##   non_tolerant           14        4
##   tolerant                5        9
## [1] "prediction error: 28.125"

It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:

##               R_NELL2           R_LINC01259     R_ENSG00000261685 
##            0.28448380            0.24052458            0.23531343 
##              R_SRGAP1               R_NPAS2               R_SPON1 
##            0.23319681            0.21515347            0.20535058 
##                R_C1QB             R_ADAMTS2                 R_PSD 
##            0.15160904            0.14723874            0.14440015 
##         subst_ADAMTS2     R_ENSG00000242861              R_AMIGO1 
##            0.13739458            0.12927273            0.12869965 
##     R_ENSG00000261488 subst_ENSG00000222179              R_DCBLD2 
##            0.12796713            0.12422006            0.11556624 
##              R_PCYT1B      subst_LACTB2.AS1               R_P2RY1 
##            0.11124854            0.10768478            0.10277556 
##               D_DIP2A                 R_MCC 
##            0.10209030            0.09945891

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
##  [1] "NRCAM"           "FAM131B"         "AFF2"           
##  [4] "ZBTB10"          "LOC171391"       "NUAK1"          
##  [7] "GPR183"          "ENSG00000278013" "HAR1A"          
## [10] "ENSG00000269246"
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "CDS1"            "PCDHGB3"         "ENSG00000255186" "LOC283177"      
## [5] "ENSG00000255819" "KLRC4"           "ENSG00000267731" "KIAA1671"
## [1] "Genes that were commonly selected in recipients and couples: "
##  [1] "C1QC"            "ENSG00000233008" "GBP1P1"         
##  [4] "FCGR1B"          "FCER1A"          "DYSF"           
##  [7] "ACVR1C"          "COL4A3"          "IGSF11"         
## [10] "RASSF6"          "PARM1"           "PRR16"          
## [13] "ADAMTS2"         "ETV7"            "GPER1"          
## [16] "ENSG00000275791" "MAOB"            "PLPP5"          
## [19] "LACTB2.AS1"      "SERPING1"        "ARHGAP20"       
## [22] "ANKRD22"         "PPFIBP1"         "ITGA7"          
## [25] "IFNG.AS1"        "LINC02384"       "ENSG00000273824"
## [28] "LOC105369827"    "C12orf42"        "SLC41A2"        
## [31] "LINC00540"       "ENSG00000211820" "SLC22A17"       
## [34] "RNU6.8"          "GOLGA6L9"        "MT2A"           
## [37] "ENSG00000267279" "TGM2"            "SIGLEC16"       
## [40] "C19orf18"        "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)

Focusing only on the genes that had a similar behaviour in the two cohorts:

We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The first component of the PCA seems to hold quite a lot of the variability in the data:

plot(pca_all)

The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##               predicted
## true           non_tolerant tolerant
##   non_tolerant           15        3
##   tolerant                5        9
## [1] "prediction error: 25"

It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:

##               R_NELL2               R_NPAS2              R_SRGAP1 
##             0.5010166             0.3934506             0.3710944 
##           R_LINC01259               R_SPON1     R_ENSG00000261685 
##             0.3603916             0.3570765             0.3307919 
##                R_C1QB             R_ADAMTS2 subst_ENSG00000222179 
##             0.3223403             0.2895340             0.2733015 
##              R_AMIGO1                R_C1QA                R_C1QC 
##             0.2287906             0.2229916             0.2093663 
##         subst_ADAMTS2              R_RASSF6             R_RETREG1 
##             0.2062619             0.1989067             0.1974573 
##     R_ENSG00000280123               R_KLHL3     R_ENSG00000242861 
##             0.1847488             0.1839481             0.1814908 
##               R_RAB15              R_DCBLD2 
##             0.1763075             0.1659366

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## character(0)
## [1] "Genes that were commonly selected in recipients and couples: "
##  [1] "C1QC"            "ENSG00000233008" "GBP1P1"         
##  [4] "FCGR1B"          "FCER1A"          "DYSF"           
##  [7] "COL4A3"          "IGSF11"          "RASSF6"         
## [10] "PARM1"           "PRR16"           "ADAMTS2"        
## [13] "ETV7"            "GPER1"           "ENSG00000275791"
## [16] "MAOB"            "PLPP5"           "LACTB2.AS1"     
## [19] "SERPING1"        "ARHGAP20"        "ANKRD22"        
## [22] "PPFIBP1"         "ITGA7"           "IFNG.AS1"       
## [25] "LOC105369827"    "C12orf42"        "SLC41A2"        
## [28] "LINC00540"       "ENSG00000211820" "RNU6.8"         
## [31] "GOLGA6L9"        "MT2A"            "ENSG00000267279"
## [34] "TGM2"            "SIGLEC16"        "C19orf18"       
## [37] "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)

Classifying primary versus secondary tolerant patients based on all the information:

Threshold of 0.95

We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The 2 first components of the PCA seem to hold quite a lot of the variability in the data:

plot(pca_all)

The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.

Along the second principal component, the patients are quite separated according to the expression of the UTY gene in the donors:

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  6                  1
##   secondary_tolerant                2                  5
## [1] "prediction error: 21.4285714285714"

It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:

## subst_ENSG00000255479           D_ZMIZ1.AS1     D_ENSG00000261783 
##            0.29437587            0.28996476            0.17782952 
##    subst_LOC105374344        D_LOC105369536 subst_ENSG00000223922 
##            0.15575841            0.14874872            0.13870779 
## subst_ENSG00000263482           subst_PSAT1                D_AREG 
##            0.12886952            0.12206696            0.11150459 
## subst_ENSG00000233435          subst_CDC25A subst_ENSG00000279948 
##            0.10119828            0.10093540            0.09494413 
##         subst_CCDC183     D_ENSG00000259797 subst_ENSG00000269982 
##            0.09345951            0.08971540            0.08750222 
## subst_ENSG00000279801    subst_LOC101927811       subst_ARHGEF34P 
##            0.08727173            0.08714081            0.08182631 
##            subst_SCIN     D_ENSG00000254750 
##            0.08161524            0.07453462

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "ENSG00000269982" "AREG"            "LOC105369536"    "ENSG00000267257"
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "ENSG00000251876"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)

Threshold of 0.90

We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The 2 first components of the PCA seem to hold quite a lot of the variability in the data:

plot(pca_all)

The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  5                  2
##   secondary_tolerant                2                  5
## [1] "prediction error: 28.5714285714286"

It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:

## subst_ENSG00000255479           D_ZMIZ1.AS1 subst_ENSG00000202392 
##            0.14724524            0.12218429            0.09254095 
##    subst_LOC105374344     D_ENSG00000261783 subst_ENSG00000254750 
##            0.08470952            0.07893238            0.06268571 
## subst_ENSG00000260855           subst_PSAT1     D_ENSG00000203819 
##            0.06216730            0.06181905            0.06101905 
##          subst_CDC25A subst_ENSG00000223922                 D_LAT 
##            0.05866603            0.05852571            0.05437619 
##     D_ENSG00000279722                D_AREG subst_ENSG00000279948 
##            0.05325714            0.05286175            0.05262857 
##     D_ENSG00000260452        D_LOC105369536            subst_TCF7 
##            0.05185714            0.05068095            0.04820762 
##     D_ENSG00000259797 subst_ENSG00000263482 
##            0.04786587            0.04554730

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
## [1] "LRRC6"
## [1] "Genes that were commonly selected in donors and couples: "
##  [1] "ENSG00000228140" "ENSG00000187536" "ENSG00000270557"
##  [4] "ENSG00000269982" "ENSG00000281100" "AREG"           
##  [7] "SAP30L.AS1"      "C6orf223"        "ENSG00000219409"
## [10] "ENSG00000219395" "MIR374B"         "RPA4"           
## [13] "RRS1.AS1"        "LOC105375624"    "LINC00964"      
## [16] "ENSG00000237711" "ENSG00000230225" "ENSG00000227388"
## [19] "ENSG00000235641" "ENSG00000255479" "ENSG00000254750"
## [22] "ENSG00000278768" "LOC105369536"    "ENSG00000211869"
## [25] "ENSG00000211884" "ENSG00000211887" "BCL2A1"         
## [28] "LAT"             "ENSG00000235672" "ENSG00000267257"
## [31] "LINC01727"       "ENSG00000277938" "ENSG00000266910"
## [34] "SIK1B"
## [1] "Genes that were commonly selected in recipients and couples: "
##  [1] "ITGB6"           "CDC25A"          "GIMAP5"         
##  [4] "ENSG00000138161" "ENSG00000257246" "ENSG00000211846"
##  [7] "PCLAF"           "TOP2A"           "ENSG00000251876"
## [10] "ENSG00000267838"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)

Focusing only on the genes that had the same behaviour in the two cohorts:

We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:

We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:

The 2 first components of the PCA seem to hold quite a lot of the variability in the data:

plot(pca_all)

The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.

We can also generate a tSNE on the same data:

We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:

##                     predicted
## true                 primary_tolerant secondary_tolerant
##   primary_tolerant                  4                  3
##   secondary_tolerant                2                  5
## [1] "prediction error: 35.7142857142857"

It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:

##          subst_CDC25A     D_ENSG00000224830              R_CDC25A 
##             0.3419204             0.2407414             0.2299922 
##           D_LINC01225           subst_TOP2A     D_ENSG00000260425 
##             0.2209212             0.1981267             0.1895842 
##                R_ASB2             subst_TTK subst_ENSG00000219747 
##             0.1779283             0.1751804             0.1688614 
##            subst_DDTL            subst_ANLN             R_SNORD17 
##             0.1688307             0.1558634             0.1521835 
##              R_TMEM53     D_ENSG00000255438           D_LOC283194 
##             0.1515672             0.1493561             0.1467402 
##             D_AGAP12P     R_ENSG00000210107               D_AQP10 
##             0.1428300             0.1415584             0.1333834 
## subst_ENSG00000228232                R_E2F8 
##             0.1299255             0.1267698

And visualise the top variables in a heatmap:

Finally, we can see which features were commonly selected between the recipients, donors and couples:

## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## character(0)
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "CDC25A" "PCLAF"  "TOP2A"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)